An evaluation of classification models for question topic categorization
نویسندگان
چکیده
We study the problem of question topic classification using a very large real-world Community Question Answering (CQA) dataset from Yahoo! Answers. The dataset contains 3.9M questions and these questions are organized in more than one thousand categories in a hierarchy. To our best knowledge, this is the first systematic evaluation of the performance of different classification methods on question topic classification as well as short texts. Specifically, we empirically evaluate the followings in classifying questions into CQA categories: 1) the usefulness of n-Gram features and bag-of-word features; 2) the performance of three standard classification algorithms (Naı̈ve Bayes, Maximum Entropy, and Support Vector Machines); 3) the performance of the state-of-the-art hierarchical classification algorithms; 4) the effect of training data size on performance; and 5) the effectiveness of the different components of CQA data, including subject, content, asker, and the best answer. The experimental results show what aspects are important for question topic classification in terms of both effectiveness and efficiency. We believe that the experimental findings from this study will be useful in real-world classification problems.
منابع مشابه
Sparse Structured Principal Component Analysis and Model Learning for Classification and Quality Detection of Rice Grains
In scientific and commercial fields associated with modern agriculture, the categorization of different rice types and determination of its quality is very important. Various image processing algorithms are applied in recent years to detect different agricultural products. The problem of rice classification and quality detection in this paper is presented based on model learning concepts includ...
متن کاملA Novel Approach To Focus Identification In Question/Answering Systems
Modern Question/Answering systems rely on expected answer types for processing questions. The answer type is a semantic category provided by Named Entity recognizer or by semantic hierarchies. We argue in this paper that Q/A systems should take advantage of the topic information by exploiting several models of question and answer categorization. The matching of the question category with the an...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملAn Empirical Comparison of Text Categorization Methods
In this paper we present a comprehensive comparison of the performance of a number of text categorization methods in two different data sets. In particular, we evaluate the Vector and Latent Semantic Analysis (LSA) methods, a classifier based on Support Vector Machines (SVM) and the k-Nearest Neighbor variations of the Vector and LSA models. We report the results obtained using the Mean Recipro...
متن کاملAnnotated Bibliography Content-Based Image Retrieval: Performance Evaluation and Semantic Scene Understanding
The general topic of my research, and thus also this annotated bibliography, is contentbased image retrieval (CBIR). Within CBIR, I picked out two areas that seem crucial to me: performance evaluation of CBIR systems and semantic scene understanding or scene categorization as a pre-step towards CBIR based on the automated annotation of scenes. In Section 2, some overviews of work done in CBIR a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- JASIST
دوره 63 شماره
صفحات -
تاریخ انتشار 2012